Take-Home Exercise 3

Creating a visualisation to show the average rating and proportion of cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location.

M.L. Kwong https://scis.smu.edu.sg/master-it-business (MITB (Analytics))https://scis.smu.edu.sg/
2022-02-20

1.0 Overview

In this take-home exercise, we aim to apply the appropriate data visualisation techniques to create a data visualisation showing the average rating and proportion cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location through the use of ggplot2 methods.

2.0 Data Import

The chocolate.csv was used to show the average rating and proportion of cocoa percent (% chocolate) greater or equal to 70% by top 15 company location.

The code chunk below was used to import the necessary packages to create the visualisation:

packages = c('tidyverse','plotly','crosstalk')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
}

3.0 Data Preparation

Step 1: Isolate columns needed (i.e. company_location, rating and cocoa_percent) Step 2: Remove “%” from cocoa_percent and convert to numeric.

choco <- read_csv("data/chocolate.csv")

choco$cocoa_percent <- gsub(pattern = "%", replacement = "", x = choco$cocoa_percent) %>% as.numeric(choco$cocoa_percent)

##subsetting the isolated columns

chocodf <- choco %>% select(company_location, rating, cocoa_percent)

##convert rating to numeric

chocodf$rating <- as.numeric(chocodf$rating)

3.1 Average Rating

  1. Creating avg_rating through grouping of data by company location, summarizing the data to get the frequency count, mean and standard deviation
  2. Passing through the output using “%>%” and use of “mutate” to create a new variable standard error (SE = standard deviation / sqrt(n - 1))
  3. Order the final dataset by top 15 company frequencies
avg_rating <- chocodf %>%
  group_by(company_location) %>%
  summarise(
    n=n(),
    mean=mean(rating),
    sd=sd(rating)
    ) %>%
  mutate(se=sd/sqrt(n-1))

avg_rating_top15 <- avg_rating %>% arrange(desc(n)) %>% slice(1:15)

3.2 Cocoa Percentage (%)

  1. Filter dataset with cocoa percentages < 70%
  2. Create avg_percent through grouping of data by company location, summarizing the data to get the frequency count, mean and standard deviation
  3. Passing through the output using “%>%” and use of “mutate” to create a new variable standard error (SE = standard deviation / sqrt(n - 1))
  4. Order the final dataset by top 15 company frequencies
avg_percent <- chocodf %>%
  filter(chocodf$cocoa_percent >=0.7) %>%
  group_by(company_location) %>%
  summarise(
    n=n(),
    mean=mean(cocoa_percent),
    sd=sd(cocoa_percent)
    ) %>%
  mutate(se=sd/sqrt(n-1))

avg_percent_top15 <- avg_percent %>% arrange(desc(n)) %>% slice(1:15)

4.0 Creating the Visualisation

4.1 Average Rating by Top 15 Companies (According to Frequency)

ggplot(avg_rating_top15) +
  geom_errorbar(
    aes(x=reorder(company_location,-n,), 
        ymin=mean-1.98*se,
        ymax=mean+1.98*se), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab("Average Rating") +
  ggtitle("Standard error of mean rating of top 15 companies (based on frequency)") + 
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

4.2 Average Cocoa Percentage by Top 15 Companies (According to Frequency)

ggplot(avg_percent_top15) +
  geom_errorbar(
    aes(x=reorder(company_location,-n,), 
        ymin=mean-1.98*se,
        ymax=mean+1.98*se), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab("Average Cocoa Percentage (%)") +
  ggtitle("Standard error of mean cocoa percentage of top 15 companies (based on frequency)") + 
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

4.3 Combining the Two Graphs Using plotly and crosstalk() method

We attempt to create an interactive plot to directly compare the two plots to identify trends.

The code chunk below does a left join of the two datasets avg_rating_top15 and avg_percent_top15 to create single dataset for the creation of the visualisation. The merge() functiionality is used.

Subsequently, the crosstalk method was used to link two of the graphs together.

##combining the two datasets

forggplotly <- merge(x=avg_rating_top15, y = avg_percent_top15, by = "company_location", all.x =TRUE)

4.3.1 Challenges Faced

  1. Overlapping x-axis labels which is manually augmented using “theme(axis.text.x = element_text(angle = 45, size = 10))”
  2. Initially, we had tried to use subplots, however this meant that there was less flexibility in having two different plot titles. As such, the crosstalk method was more appropriate, using the manual theme configuration code to apply an angle on the x-axis labels.
d <- highlight_key(forggplotly)

#rating (x), percent (y)

p1<- ggplot(d) +
  geom_errorbar(
    aes(x=reorder(company_location,-n.x,), 
        ymin=mean.x-1.98*se.x,
        ymax=mean.x+1.98*se.x), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=mean.x), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab("Average Rating") +
  theme(axis.text.x = element_text(angle = 45, size = 10)) +
  ggtitle("Standard error of mean rating of top 15 companies (based on frequency)")  

p2 <-ggplot(d) +
  geom_errorbar(
    aes(x=reorder(company_location,-n.y,), 
        ymin=mean.y-1.98*se.y,
        ymax=mean.y+1.98*se.y), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=mean.y), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab("Average Cocoa Percentage (%)") +
  theme(axis.text.x = element_text(angle = 45, size = 10)) +
  ggtitle("Standard error of mean cocoa percentage of top 15 companies 
          (based on frequency)") 

gg1 <- ggplotly(p1)
gg2 <- ggplotly(p2)


crosstalk::bscols(gg1,
                  gg2,
                  widths = 12)

5.0 Findings

Top Company

Average Rating

Average Cocoa Percentage (%)

6.0 References